416 research outputs found

    SqueezeLLM: Dense-and-Sparse Quantization

    Full text link
    Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is open-sourced and available online

    Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation

    Full text link
    We introduce a sophisticated multi-speaker speech data simulator, specifically engineered to generate multi-speaker speech recordings. A notable feature of this simulator is its capacity to modulate the distribution of silence and overlap via the adjustment of statistical parameters. This capability offers a tailored training environment for developing neural models suited for speaker diarization and voice activity detection. The acquisition of substantial datasets for speaker diarization often presents a significant challenge, particularly in multi-speaker scenarios. Furthermore, the precise time stamp annotation of speech data is a critical factor for training both speaker diarization and voice activity detection. Our proposed multi-speaker simulator tackles these problems by generating large-scale audio mixtures that maintain statistical properties closely aligned with the input parameters. We demonstrate that the proposed multi-speaker simulator generates audio mixtures with statistical properties that closely align with the input parameters derived from real-world statistics. Additionally, we present the effectiveness of speaker diarization and voice activity detection models, which have been trained exclusively on the generated simulated datasets

    Testing Lorentz Invariance with Neutrinos from Ultrahigh Energy Cosmic Ray Interactions

    Full text link
    We have previously shown that a very small amount of Lorentz invariance violation (LIV), which suppresses photomeson interactions of ultrahigh energy cosmic rays (UHECRs) with cosmic background radiation (CBR) photons, can produce a spectrum of cosmic rays that is consistent with that currently observed by the Pierre Auger Observatory (PAO) and HiRes experiments. Here, we calculate the corresponding flux of high energy neutrinos generated by the propagation of UHECR protons through the CBR in the presence of LIV. We find that LIV produces a reduction in the flux of the highest energy neutrinos and a reduction in the energy of the peak of the neutrino energy flux spectrum, both depending on the strength of the LIV. Thus, observations of the UHE neutrino spectrum provide a clear test for the existence and amount of LIV at the highest energies. We further discuss the ability of current and future proposed detectors make such observations.Comment: final version to appear in Astroparticle Physic

    Full Stack Optimization of Transformer Inference: a Survey

    Full text link
    Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference

    IceHEP High Energy Physics at the South Pole

    Full text link
    With the solar and SN87 neutrino observations as proofs of concepts, the kilometer-scale neutrino experiment IceCube will scrutinize its data for new particle physics. In this paper we review the prospects for the realization of such a program. We begin with a short overview of the detector response and discuss the reach of ``beam'' luminosity. After that we discuss the potential of IceCube to probe deviations of neutrino-nucleon cross sections from the Standard Model predictions at center-of-mass energies well beyond those accessible in man-made accelerators. Then we review the prospects for extremely long-baseline analyses and discuss the sensitivity to measure tiny deviations of the flavor mixing angle, expected to be induced by quantum gravity effects. Finally we discuss the potential to uncover annihilation of dark matter particles gravitationally trapped at the center of the Sun, as well as processes occurring in the early Universe at energies close to the Grand Unification scale.Comment: Typos corrected and references added. Version with high resolution figures available at http://www.hep.physics.neu.edu/staff/doqui/icehep_rev6.p

    Light Higgsino from Axion Dark Radiation

    Full text link
    The recent observations imply that there is an extra relativistic degree of freedom coined dark radiation. We argue that the QCD axion is a plausible candidate for the dark radiation, not only because of its extremely small mass, but also because in the supersymmetric extension of the Peccei-Quinn mechanism the saxion tends to dominate the Universe and decays into axions with a sizable branching fraction. We show that the Higgsino mixing parameter mu is bounded from above when the axions produced at the saxion decays constitute the dark radiation: mu \lesssim 300 GeV for a saxion lighter than 2m_W, and mu less than the saxion mass otherwise. Interestingly, the Higgsino can be light enough to be within the reach of LHC and/or ILC even when the other superparticles are heavy with mass about 1 TeV or higher. We also estimate the abundance of axino produced by the decays of Higgsino and saxion.Comment: 18 pages, 1 figure; published in JHE

    The organisation and delivery of health improvement in general practice and primary care: a scoping study

    Get PDF
    Background This project examines the organisation and delivery of health improvement activities by and within general practice and the primary health-care team. The project was designed to examine who delivers these interventions, where they are located, what approaches are developed in practices, how individual practices and the primary health-care team organise such public health activities, and how these contribute to health improvement. Our focus was on health promotion and ill-health prevention activities. Aims The aim of this scoping exercise was to identify the current extent of knowledge about the health improvement activities in general practice and the wider primary health-care team. The key objectives were to provide an overview of the range and type of health improvement activities, identify gaps in knowledge and areas for further empirical research. Our specific research objectives were to map the range and type of health improvement activity undertaken by general practice staff and the primary health-care team based within general practice; to scope the literature on health improvement in general practice or undertaken by health-care staff based in general practice and identify gaps in the evidence base; to synthesise the literature and identify effective approaches to the delivery and organisation of health improvement interventions in a general practice setting; and to identify the priority areas for research as defined by those working in general practice. Methods We undertook a comprehensive search of the literature. We followed a staged selection process involving reviews of titles and abstracts. This resulted in the identification of 1140 papers for data extraction, with 658 of these papers selected for inclusion in the review, of which 347 were included in the evidence synthesis. We also undertook 45 individual and two group interviews with primary health-care staff. Findings Many of the research studies reviewed had some details about the type, process or location, or who provided the intervention. Generally, however, little attention is paid in the literature to examining the impact of the organisational context on the way services are delivered or how this affects the effectiveness of health improvement interventions in general practice. We found that the focus of attention is mainly on individual prevention approaches, with practices engaging in both primary and secondary prevention. The range of activities suggests that general practitioners do not take a population approach but focus on individual patients. However, it is clear that many general practitioners see health promotion as an integral part of practice, whether as individual approaches to primary or secondary health improvement or as a practice-based approach to improving the health of their patients. Our key conclusion is that there is currently insufficient good evidence to support many of the health improvement interventions undertaken in general practice and primary care more widely. Future Research Future research on health improvement in general practice and by the primary health-care team needs to move beyond clinical research to include delivery systems and be conducted in a primary care setting. More research needs to examine areas where there are chronic disease burdens – cancer, dementia and other disabilities of old age. Reviews should be commissioned that examine the whole prevention pathway for health problems that are managed within primary care drawing together research from general practice, pharmacy, community engagement, etc

    Decoherence and CPT Violation in a Stringy Model of Space-Time Foam

    Full text link
    I discuss a model inspired from the string/brane framework, in which our Universe is represented as a three brane, propagating in a bulk space time punctured by D0-brane (D-particle) defects. As the D3-brane world moves in the bulk, the D-particles cross it, and from an effective observer on D3 the situation looks like a ``space-time foam'' with the defects ``flashing'' on and off (``D-particle foam''). The open strings, with their ends attached on the brane, which represent matter in this scenario, can interact with the D-particles on the D3-brane universe in a topologically non-trivial manner, involving splitting and capture of the strings by the D0-brane defects. Such processes are described by logarithmic conformal field theories on the world-sheet. Physically, they result in effective decoherence of the string matter on the D3 brane, and as a result, of CPT Violation, but of a type that implies an ill-defined nature of the effective CPT operator. Due to electric charge conservation, only electrically neutral (string) matter can exhibit such interactions with the D-particle foam. This may have unique, experimentally detectable, consequences for electrically-neutral entangled quantum matter states on the brane world, in particular the modification of the pertinent EPR Correlation of neutral mesons in a meson factory.Comment: 41 pages Latex, five eps figures incorporated. Uses special macro
    corecore